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Abstract 

This research article presents a comprehensive study on the pre-training of two 
language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using 
the Masked Language Modeling (MLM) pre-training approach. The study details 
the various data collection and pre-processing steps involved in building a large 
corpus of over six million sentences and 71 billion tokens, sourced from social media 
platforms such as Facebook, Twitter, and YouTube. The pre-training process was 
carried out using the HuggingFace Transformers API, and the paper elaborates on 
the configurations and training methodologies of the models. The study concludes 
by demonstrating the high accuracy rates achieved by both MorRoBERTa and 
MorrBERT in multiple downstream tasks, indicating their potential effectiveness 


in natural language processing applications specific to the Moroccan Dialect. 
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1 Introduction 


Natural Language Processing (NLP) has become a 
rapidly growing research field in recent years due to the 
emergence of deep learning models such as the Trans- 
former [27] architecture. This neural network model 
was first introduced by [27] in 2017 and has revolution- 
ized the field of NLP by providing a powerful way to 
process long sequences of text using attention mech- 
anisms. The Transformer architecture has surpassed 
older models such as LSTMs (Long short-term mem- 
ory networks) [17] and GRUs (Gated recurrent units) 
[10], and has served as a foundational architecture for 
a range of subsequent models that have achieved re- 
markable performance in various NLP tasks, including 
question-answering, language modeling, text classifica- 
tion, and others. These models include BERT (Bidi- 
rectional Encoder Representations from Transformers) 
[13], XLNet [31], and RoBERTa (Robustly Optimized 
BERT Pretraining Approach) [21], among others. 
One of the key benefits of the Transformer architec- 
ture is that once pre-trained on a large corpus of text, 
the models can be fine-tuned for specific tasks using 
smaller, task-specific datasets. This approach has led 
to significant improvements in a range of NLP tasks, es- 
pecially when large, well-curated training datasets are 
available. Despite the impressive progress in NLP en- 
abled by the Transformer architecture, most of the pre- 
trained models are based on English or other widely 
spoken languages, and there is limited availability of 
models for other languages [22]. Moreover, most of 
the existing pre-trained models are multilingual, which 
means that they are trained on a collection of languages 
but do not account for the nuances and peculiarities of 


individual languages. 


This lack of language-specific models is especially 
pronounced in the case of the Moroccan dialect, which 
has low resources and has received little attention from 
NLP researchers. The Moroccan dialect is spoken by 
more than 36 million people and is the primary commu- 
nication tool in everyday life, while Modern Standard 
Arabic (MSA) is the official language used in Morocco 
[2]. The Moroccan dialect is also commonly used on so- 
cial media platforms to express opinions and thoughts. 


However, the Moroccan dialect poses significant 
challenges for NLP due to several factors. For one, 
the Moroccan dialect can be written using different 
scripts, including the Arabic alphabet, the Latin al- 
phabet, or a combination of the two [23]. Additionally, 
the Moroccan dialect has its own unique vocabulary 
and syntax that differ from both MSA and other lan- 
guages [26]. These factors make the Moroccan dialect 
an interesting challenge for NLP researchers to develop 
language-specific pre-trained models. To address this 
need, we propose to pre-train two BERT-like models 
for the Moroccan dialect. Specifically, we base the ar- 
chitecture of our models on the ROBERTa model and 
the original BERT model, which have achieved state- 
of-the-art performance in NLP tasks. 


To pre-train our models, we use a dataset consisting 
of over six million sequences of YouTube comments, 
Facebook comments, and tweets. We then compare the 
performance of our models with existing Moroccan di- 
alect models and multilingual models, including multi- 
lingual BERT [13], XLM-R [11], CamelBERT [18], and 
MARBERT [5] using both publicly available datasets 
for the Moroccan dialect and our own manually an- 
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notated data. Our primary contributions in this work 
are two-fold. First, we present two pre-trained BERT- 
like models, MorRoBERTa and MorrBert, specifically 
designed for the Moroccan dialect. These models are 
pre-trained on a large dataset of more than six million 
sequences consisting of comments from social media 
platforms such as YouTube, Facebook, and Twitter. 

We use the pre-trained models to perform down- 
stream natural language understanding tasks, such as 
sentiment analysis, dialect identification, and language 
classification as used by Moroccans. Our results show 
that both MorRoBERTa and MorrBert are effective 
for sentiment analysis, dialect identification, and lan- 
guage classification tasks on Moroccan datasets, and 
can achieve comparable or better performance than 
other state-of-the-art models in the field. Second, we 
compare the performance of our models to that of ex- 
isting Moroccan dialect and multilingual models using 
varying data sizes and varying classes. Our experi- 
ments show that MorRoBERTa and MorrBert consis- 
tently outperform other models, indicating their effec- 
tiveness in handling the unique challenges posed by the 
Moroccan dialect. 

Overall, our work provides a valuable contribution 
to the field of natural language processing, especially 
for low-resource languages like the Moroccan dialect. 
Our pre-trained models can be fine-tuned for specific 
tasks such as sentiment analysis, named entity recog- 
nition, and machine translation, among others. Fur- 
thermore, our experiments highlight the importance 
of developing language-specific models rather than re- 
lying on multilingual models, especially for languages 
with distinct characteristics and limited resources. Our 
work opens up avenues for future research in the de- 
velopment of language-specific models for other low- 
resource languages. 


2 Related Work 


The development of pre-trained language models based 
on the Transformer architecture has revolutionized the 
field of NLP in recent years. One of the most successful 
examples is the BERT model proposed by [13], which 
employs a multi-layer Transformer-encoder architec- 
ture and two pre-training tasks, MLM, and Next Sen- 
tence Prediction (NSP), to learn contextualized word 
representations. BERT has achieved state-of-the-art 
performance in a wide range of NLP tasks, includ- 
ing sentiment analysis, question answering, and nat- 
ural language inference. In response to the success of 
BERT, many BERT-based models with varying goals 
have been proposed in the literature. For instance, 
RoBERTa [21] optimizes BERT’s pre-training meth- 
ods by removing the NSP pre-training task and signif- 
icantly increasing the size of the training corpus and 
batch size. DistiIBERT [24] and ALBERT [20] re- 
duce the model size and complexity of BERT by us- 
ing knowledge distillation and parameter sharing tech- 
niques, respectively. 


Apart from these models, several pre-trained mod- 
els have been developed specifically for low-resource 
languages, including XLM [12], which pre-trains a 
shared multilingual model on parallel data from 100 
languages, and mBERT [13], which pre-trains a sin- 
gle model on a large corpus of Wikipedia articles from 
104 languages. These models have been shown to im- 
prove the performance of downstream NLP tasks in 
low-resource languages, although they may not capture 
the nuances and peculiarities of individual languages. 

One of the major challenges in developing NLP mod- 
els for Arabic dialects is the lack of high-quality, large- 
scale labeled datasets for training and evaluation. This 
challenge is particularly relevant for the Moroccan di- 
alect, which has unique phonetic and grammatical fea- 
tures that distinguish it from other Arabic dialects. 
To overcome this challenge, researchers have proposed 
various pre-trained models for the Moroccan dialect, 
such as CamelBERT [18], MARBERT [5], QARiB [4] 
and AraBERT [7], which are based on the BERT ar- 
chitecture and pre-trained on multidialectal corpora. 
However, these models have limitations due to the rel- 
atively low representation of the Moroccan dialect in 
the training corpus, which can result in reduced perfor- 
mance and accuracy when applied to specific dialectal 
variations. 

In contrast, DarijaBERT, the monodialectal Moroc- 
can model developed by [14], was trained specifically 
on a Moroccan Arabic dialect corpus, which allows for 
more accurate modeling of the linguistic features and 
nuances of the Moroccan dialect. The model also sup- 
ports both Arabic and Latin character representations 
of the dialect. 

In this work, we propose to pre-train two BERT- 
like models, MorRoBERTa, and MorrBERT, specif- 
ically designed for the Moroccan dialect. We use a 
large corpus of social media data to pre-train our mod- 
els using MLM and compare their performance to ex- 
isting Moroccan dialect and multilingual models, as 
well as to our own manually annotated data. Our ex- 
periments demonstrate that MorRoBERTa and Mor- 
rBERT are comparable to or better than the perfor- 
mance of other models and achieve state-of-the-art per- 
formance in downstream NLP tasks for the Moroc- 
can dialect, highlighting the importance of develop- 
ing language-specific models for low-resource languages 
like the Moroccan dialect. 


3 Pre-training Process 
3.1 Data Description 


In this section, we describe our proposed approaches 
for large data collection for Moroccan Dialect. 


Data Collection 


To train BERT-like models, it is necessary to amass sig- 
nificant amounts of raw text data and preprocess it ac- 
cordingly. For our Moroccan Dialect project, we identi- 
fied three sources from which to compile large datasets: 
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Table 1: Dataset used to train our models. 


eee Number of Number of Size 
Words Sentences 

Facebook 26,559,210 3,191,281 144 MB 

YouTube 43,901,655 2,785,302 347 MB 

Twitter 869,184 51,479 6.61 MB 


Facebook, Twitter, and YouTube. Consequently, our 
pre-training corpus primarily consists of informal lan- 
guage comments, which are commonly used by social 
media users to express their opinions. However, the 
aim of collecting this data is to capture as many vo- 
cabularies and texts used on social media as possible. 
We were able to gather over six million sentences writ- 
ten and shared by Moroccan users, resulting in a final 
dataset that comprised more than 71 billion tokens, 
equivalent to approximately 700 MB of text data. 

We used different collection approaches for each 
source, as described below: 


e Twitter: We used the Twitter API to collect data 
for research by searching for Moroccan-related 


hashtags such as #Maroc, Holle aga, #bdar- 
ija, #dahk, # gha, etc. and extracting all tweets 


with a Moroccan location setting. We collected 
around 1.1 million tweets, and after filtering and 
removing duplicates, we ended up with 65,000 
tweets. 


e Facebook: To collect data from Facebook, we 
selected pages with a large number of members 
and publications related to Moroccan contexts and 
users, and we used the “Facebook Graph API” 
to import the data. We collected nearly 5 mil- 
lion comments between 11/09/18 and 18/11/18. 
The total number of comments collected is equal 
to 157.7M. We cleaned the comments by deleting 
duplicates and empty ones. 


e YouTube: We utilized the YouTube API version 
3.0 to extract comments from different channels. 
We selected 33 channels based on their popularity, 
identified by Hypeauditor [3] (most commented 
and followers), and used SocialReaper! to scrape 
the comments. In total, we collected 2,785,302 
comments from 27,000 different videos. 


Table 1 contains statistical information for each 
source that was utilized to construct our dataset. 


Pre-processing 


In the realm of data analysis and machine learning, 
the preprocessing steps applied to sequences are crucial 
in ensuring accurate and consistent results [15]. In an 
effort to standardize the data, the dataset underwent 
a series of modifications. Firstly, any hashtags, URLs, 
and email addresses were removed from the sequences 
in order to eliminate any noise or distractions from the 
data. 


lhttps://github.com/ScriptSmith/socialreaper 


Raw Corpus 

Y 

Remove hashtags, URLs and email addresses 
Y 

Remove non-Latin and non-Arabic characters 
Y 

Delete diacritical marks 
j 
Remove the repetition of a letter in a word 
Y 
Delete sequences containing less than two words 

Y 


Pre-processed Corpus 


Figure 1: Data preprocessing process 


Additionally, the repetition of a letter in a word was 
removed, as this can often be indicative of emphasis in 
social media communication rather than a significant 
feature of the text. Diacritics marks were also removed 
to ensure consistency and prevent potential misinter- 
pretations of the data. 

Furthermore, non-Latin and non-Arabic characters 
were removed from the sequences to create a more uni- 
form dataset. This was done in order to reduce the 
variability of the data and allow for a more accurate 
comparison between different sequences. 

Finally, only sequences with at least two words were 
retained. This was done in order to eliminate any 
overly simplistic or ambiguous sequences, which could 
potentially skew the results. By applying these prepro- 
cessing steps, the dataset was standardized and pre- 
pared for further analysis and machine learning appli- 
cations. 

The preprocessing process is shown in Fig. 1. 


Tokenization 


Each model employs a tokenizer that adheres to its 
original research paper to maintain conformity. For 
the MorRoBERTa model, a byte-level BPE tokenizer is 
used [25]. This tokenizer merges byte-based characters 
based on their frequency of occurrence until the desired 
vocabulary size is achieved. 

In contrast, the MorrBERT model employs a Word- 
Piece tokenizer [30], which is similar to the BPE tok- 
enizer. The WordPiece tokenizer creates subwords of 
characters that are likely to appear in the training data 
and adds them to the vocabulary. For further details 
on the parameters, please refer to Table 2. 


3.2 Pre-training Models 


Our team developed two models, MorRoBERTa and 
MorrBERT, using the same corpus presented in last 
Section. Both models were trained using MLM pre- 
training and a vocabulary size of 52K subword to- 
kens. We utilized HuggingFace Transformers API [29] 
to build the models, and they were executed on the 
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Table 2: Displays the configurations for the Mor- 
RoBERTa and MorrBERT models. 
MorRoBERTa  MorrBERT 
train_batch_size 128 64 
steps 565,980 565,980 
vocab_size 52K 52K 
Vocab Tokenizer Byte-level BPE WordPiece 
Model type roberta bert 
hidden_layers 6 12 
num_parameters 83,504,416 125,977,344 


HPC-MARWAN [1] platform using GPU cards for ac- 
celerated performance. 

MorRoBERTa is a smaller version of the RoBERTa- 
base[21] model with 6 layers, 12 attention heads, 768 
hidden dimensions, and a maximum sequence length 
of 512. During training, the batch size was set to 128, 
and the model was trained for a total of 565,980 steps. 
The training process took nearly 92 hours to complete 
12 epochs across the full training set. 

Similarly, MorrBERT was configured exactly like the 
BERTBASE [13] model, with 12 layers, 12 attention 
heads, and a batch size of 64. The model was also 
trained for a total of 565,980 steps, but the training 
process took nearly 120 hours to complete 12 epochs 
across the full training set. 

Table 2 provides a summary of the different config- 
urations for MorRoBERTa and MorrBERT, highlight- 
ing the varying specifications and training times re- 
quired for these models. Our findings underscore the 
importance of selecting the appropriate configuration 
and training methodology for neural models in natural 
language processing, and the computational resources 
required to achieve high accuracy and robustness in 
these applications. 


4 Evaluation Tasks and Test Data 


In order to evaluate the effectiveness of our models, 
we rely on both publicly available datasets for the Mo- 
roccan dialect and our own manually annotated data. 
Our assessment focuses on three main tasks: sentiment 
analysis, dialect identification, and language classifica- 
tion as used by Moroccans. 

The Sentiment Analysis task, also referred to as Po- 
larity Detection, is a type of classification task that 
involves analyzing a given text and assigning it a sen- 
timent polarity label [8]. The primary goal of this task 
is to determine whether a piece of text expresses a pos- 
itive, negative, or neutral sentiment. To evaluate the 
effectiveness of sentiment analysis models, we utilized 
two Moroccan dialect datasets, namely MAC [16] and 
MSDA [9]. 

The MAC dataset consists of two class, namely po- 
larity (positive, negative, neutral, and mixed) and lan- 
guage of the tweets (Standard Arabic or Dialectal Ara- 
bic). On the other hand, the MSDA dataset comprises 
three class: Arabic dialect (Algerian, Lebanese, Moroc- 
can, Tunisian, and Egyptian), topic (other, politics, 


health, social, sport, and economics), and sentiment 
analysis (positive, negative, and neutral). For more in- 
formation about these datasets, please refer to Table 
3. 

Arabic dialect identification involves recognizing and 
distinguishing the various spoken dialects of the Arabic 
language. As Arabic is spoken across several countries 
in the Middle East and North Africa, there are signif- 
icant differences in the way people pronounce words, 
use vocabulary, and apply grammar rules across di- 
alects. Arabic dialect identification can be done at 
different levels of detail, including binary (distinguish- 
ing between Modern Standard Arabic and dialects), 
regional (such as Gulf, Iraqi, Levantine, Egyptian, and 
North African dialects), and country-specific (for ex- 
ample, Moroccan, Algerian, Saudi dialects, etc.) [6]. 

To perform the dialect identification task, two dis- 
tinct datasets are commonly used. The first is MSDA 
[9] Arabic Dialect Detection for Social Media Posts, 
which is a labeled dataset with around 50K sentences. 
The second is IADD [32] which stands for Integrated 
Arabic Dialect Identification Dataset. This dataset 
consists of three categories: region (MGH, LEV, EGY, 
IRQ, GLF, or general), country (MAR, TUN, DZ, 
EGY, IRQ, SYR, JOR, PSE, LBN), and data source 
(PADIC, DART, AOC, SHAMI, or TSAC). To simplify 
the task, we transformed the IADD dataset into a bi- 
nary format where one label indicates the Moroccan 
dialect, while the other label represents all other di- 
alects. Table 4 provides additional details about these 
datasets. 

Language classification is the process of automati- 
cally identifying the language of a given text [28]. This 
task is an important part of natural language process- 
ing, as it is often necessary to know the language of 
a text in order to properly analyze or process it. In 
our language classification task, we utilized our own 
dataset of Facebook comments, which we manually an- 
notated with seven different labels, including Dialect 
in Latin Alphabet (DAL), Dialect in Arabic Alphabet 
(DAA), Classical Arabic (ARC), French (FRN), En- 
glish (ANG), Spanish (ESP), and Others (AUT). Ta- 
ble 5 provides a detailed description of the dataset, 
including the content and label descriptions. 


5 Results and Discussion 


We conducted experiments for each of the tasks 
described earlier by fine-tuning the MorrBert and 
MorRoBERTa models, and comparing their perfor- 
mance with that of other models in the field, such as 
mBERT, XLM-R, CamelBERT-mix [18], MARBERT, 
and DBERT-mix [14]. The models’ performance was 
evaluated using the F1 and accuracy measures. 

All models were fine-tuned for four epochs using a 
batch size of 16 and the Adam optimizer [19]. The 
default values were maintained for other hyperparam- 
eters. We split the data into 80% for training and 20% 
for testing, using a random stratified split. We con- 
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Table 3: An overview of the evaluation datasets for the sentiment analysis task. 


Dataset Size Number Number Positive Negative Neutral Domain 
of Class of Labels polarity labels polarity labels labels 
MAC 1.71 Mo 2 4 10,671 2,057 17,272 Twitter 
eT a 653Mo 1 3 6,792 15,385 30,033 Twitter 
Table 4: An overview of the evaluation datasets for the dialect identification task. 
Dataset Size Number Number Moroccan Total Domain 
of Class of Labels polarity labels Dialect 
IADD 24.6 Mo 3 10 7,213 135,804 Varied 
Dialect MSDA 6.62 Mo 1 5 6,792 49,306 Twitter 
Table 5: An overview of the evaluation dataset for the language classification task. 
Size ae — at Total Domain Labels Description 2 
7.3 Mo 7 75,509 Facebook DAL Comment in Dialect Latin Alphabet 28,143 
DAA Comment in Dialect Arabic Alphabet 27,012 
ARC Comment in Classic Arabic 13,324 
FRN Comment in French 5,300 
ANG Comment in English 1,172 
ESP Comment in Spanish 251 
AUT If a comment meets any of the follow- 30 


ing criteria: other languages, only Face- 
book usernames, mixed Arabic/Latin 
characters, named entities, only num- 
bers, or ambiguous 


ducted these experiments using GPUs in Google Co- 
laboratory?. 

The experimental results, comparing the perfor- 
mance of MorrBert and MorRoBERTa models to other 
models across various downstream tasks, are presented 
in Tables 6, 7, and 8. 


5.1 Sentiment Analysis 


The results of the sentiment analysis task reveal the im- 
pressive performance of MorrBert and MorRoBERTa 
in terms of accuracy and F1 scores on both the MAC 
and MSDA datasets, as displayed in Table 6. Mor- 
RoBERTa achieved F1 scores of 74.44% and 78.07% for 
the MAC and MSDA datasets, respectively, while Mor- 
rBert achieved F1 scores of 75.13% and 76.50% for the 
same datasets. These scores are comparable to or even 
surpass the performance of other models in the field, 
such as CamelBERT-mix, MARBERT, and XLM-R. 

Interestingly, we found that the DBERT-mix model, 
which is designed specifically for the Moroccan dialect 
of Arabic, did not perform as well as expected. This 
may suggest that models trained on more general vari- 
eties of Arabic are better suited for sentiment analysis 
tasks on these datasets. 


https: //colab.research.google.com/ 


5.2 Dialect Identification 


Table 7 provides a comprehensive overview of the 
accuracy and F1 scores for dialect identification. Our 
results show that MorRoBERTa and MorRBERT 
perform similarly on the MSDA dataset, with Mor- 
RoBERTa achieving slightly higher scores in both 
F-1 and accuracy measures. However, on the IADD 
dataset, both models significantly outperform the 
other models, with MorRoBERTa achieving the 
highest scores in both measures. DarijaBERT-mix 
performs worse than the other models on both 
datasets, while MARBERT and CamelBERT-mix 
perform well, especially on the IADD dataset. 


5.3 Moroccan Language Classification 


In the language classification task, the findings from 
Table 8 reveal that MorrBert and MorRoBERTa out- 
performed other models in terms of both accuracy and 
F1 score. MorrBert achieved an impressive F1 score 
of 91.06%, while MorRoBERTa achieved a notable F1 
score of 90.33%. Although DBERT-mix, MARBERT, 
CamelBERT-mix, mBERT, and XLM-R also demon- 
strated high accuracy and F1 scores, their performance 
was slightly lower compared to MorrBert and Mor- 
RoBERTa. 

Overall, the experiments showed that fine-tuned 
MorrBert and MorRoBERTa models are effective for 
sentiment analysis, dialect identification, and language 
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Table 6: Sentiment Analysis accuracy and F1-Macro 
scores on the 10k MAC and MSDA Dataset. 


MAC MSDA 
Model Acc. Fl Acc. Fl 
CamelBERT-mix 83.8 80.1 85.2 77.9 
DBERT-mix 774 72.2 83.9 75.9 
MARBERT 85.0 82.2 848 77.6 
mBERT 74.1 67.8 83.2 75.0 
XLM-R 75.1 69.1 83.4 75.8 
MorrBERT 79.2 75.1 844 76.5 
MorRoBERTa 78.9 74.4 85.1 78.1 


Table 7: Dialect Identification accuracy and F1-Macro 
scores on the 10k MSDA and IADD Datasets. 


MSDA IADD 
Model Acc. Fl Acc. Fl 
CamelBERT-mix 81.4 75.8 99.4 96.9 
DBERT-mix 76.1 70.9 98.9 93.9 
MARBERT 82.7 78.0 99.5 97.4 
mBERT 76.4 70.8 99.1 94.8 
XLM-R 69.0 62.7 98.2 90.4 
MorrBERT 75.1 67.0 99.2 95.3 
MorRoBERTa 78.2 72.8 99.6 98.6 


classification tasks on Moroccan datasets, and can 
achieve comparable or better performance than other 
state-of-the-art models in the field. Additionally, mod- 
els specifically designed for Arabic dialects such as 
MARBERT and CamelBERT-mix also show promis- 
ing results. 


6 Conclusion 


This article outlines our proposed process for creating a 
large Moroccan Dialect dataset as well as our approach 
for pre-training two BERT-like models. Our aim was 
to compile a dataset with over six million sentences. 
written and shared by Moroccan users on three differ- 
ent platforms: Facebook, Twitter, and YouTube. In 
the end, we obtained a dataset of more than 71 bil- 
lion tokens, which we standardized using a series of 
preprocessing steps. We used two different tokenizers 
- a Byte-level BPE tokenizer for MorRoBERTa and a 
WordPiece tokenizer for MorrBERT - and both mod- 
els were trained using MLM pre-training. To build 
the models, we utilized the HuggingFace Transformers 
API. and we executed them on the HPC-MARWAN 
platform using GPU cards for accelerated performance. 

The models exhibited different specifications and 
training times, highlighting the importance of selecting 
the appropriate configuration for a model based on the 
task at hand and the computational resources available 
for training. Through our pre-training process, we de- 
veloped two robust models for Moroccan Dialect, and 
we hope that this work will contribute to the growing 
body of research on NLP for low-resource languages. 

Our models are publicly available for research pur- 
poses via the Hugging Face repository®. 


Shttps://huggingface.co/otmangi 


Table 8: Language Classification accuracy and F1- 
Macro scores on our Dataset of Facebook Comments. 


Our DATA 
Model Acc. Fl 
CamelBERT-mix 93.9 87.7 
DBERT-mix 94.0 87.7 
MARBERT 94.1 88.0 
mBERT 93.5 88.0 
XLM-R 93.1 87.1 
MorrBERT 95.2 91.1 
MorRoBERTa 94.9 90.3 
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